End-Semester Project

Bivariate Analysis between GDP per capita, Sanitation and Life Expectancy across Nations in 2010

Aman Das [BS2206]

Raj Pratap Singh [BS2219]

Shreyansh Mukhopadhyay [BS2147]

Introduction

Overview

Our Aim is to determine which features more directly affect the Life Expectation of the citizens.

\(~\) \(~\)

Data Points

This presentation demonstrates the capabilities of Bivariate Analysis on datasets, to infer relationship between various features of Nations.

  • log of GDP per capita: Logarithm (base \(e\)) of Gross Domestic Product (in $) per citizen. Adjusted for Inflation. [lngdp]
  • Sanitation Access %: Percentage of people using at least basic Sanitation facilities, not shared with other households. [snt]
  • Life Expectancy: The average number of years a newly born child would live, provided current mortality patterns hold. [lfx]

Data

Code
script.dir <- getSrcDirectory(function(x) {x})
# setwd(script.dir)

numerise = function(x){
  x[grepl("k$", x)] <- as.numeric(sub("k$", "", x[grepl("k$", x)]))*10^3
  x <- as.numeric(x)
  return(x)
}

d1_raw = read.csv(file.path(".","Data","gdp.csv"), fileEncoding = 'UTF-8-BOM')
d2_raw = read.csv(file.path(".","Data","sanitation.csv"), fileEncoding = 'UTF-8-BOM')
d3_raw = read.csv(file.path(".","Data","life_expectancy.csv"), fileEncoding = 'UTF-8-BOM')

yearname = "X2010"

d1 = d1_raw[!is.na(numerise(d1_raw[, yearname])),][,c("country", yearname)]
colnames(d1)[2] = "lngdp"
d2 = d2_raw[!is.na(numerise(d2_raw[, yearname])),][,c("country", yearname)]
colnames(d2)[2] = "snt"
d3 = d3_raw[!is.na(numerise(d3_raw[, yearname])),][,c("country", yearname)]
colnames(d3)[2] = "lfx"

dtemp = merge(x = d1, y = d2, by = "country")
d = merge(x = dtemp, y = d3, by = "country")

d$lngdp = log(numerise(d$lngdp))

write.csv(d, "./Data/assembled.csv")

kable(head(d, 6L))
country lngdp snt lfx
Afghanistan 6.265301 34.9 60.5
Albania 8.183118 95.2 78.1
Algeria 8.273847 87.0 74.5
Andorra 10.454495 100.0 81.8
Angola 8.291547 41.1 60.2
Antigua and Barbuda 9.546813 86.3 75.9

Elementary Univariate Analysis

Measures of Central Tendency

Mean or Arithmetic Mean \(\bar{x}\), Median \(\operatorname{median}(x)\) and Mode \(\operatorname{mode}(x)\) are some measures of central tendency in the sample.

Formulae

\[ \begin{aligned} x = \{x_1, x_2, \ldots,x_{n-1}, x_n\} && \operatorname{mean}(x)=\bar{x} = \frac{1}{n} \sum _{i=1}^{n}(x_{i}) \end{aligned} \]\[ \begin{aligned} \operatorname{median}(x)= \frac{x_{\lfloor\frac{n+1}{2}\rfloor}+x_{\lfloor\frac{n+2}{2}\rfloor}}{2} && \operatorname{mode}(x) = x_i \text{ s.t. } \operatorname{Pr}(x_i) = \operatorname{sup}(\operatorname{Pr}(x)) \end{aligned} \]

Note: \(f_i\) is the frequency of the ith observation. \(x_{(i)}\) is the ith largest observation.

Code
getmode <- function(v) {
 uniqv <- unique(v)
 freq = max(tabulate(match(v, uniqv)))
 res = uniqv[which.max(tabulate(match(v, uniqv)))]
 if (freq == 1) res = NULL
 return(res)
}

d_central = data.frame(
  row.names = "Variable",
  Variable = c(
    "*ln(GDP)*",
    "*Sanitation*",
    "*Life Exp.*"
  ),
  Mean = c(
    mean(d$lngdp),
    mean(d$snt),
    mean(d$lfx)
  ),
  Median = c(
    median(d$lngdp),
    median(d$snt),
    median(d$lfx)
  ),
  Mode = c(
    getmode(d$lngdp),
    getmode(d$snt),
    getmode(d$lfx)
  )
)

kable(
  d_central,
  col.names = c(
    "$\\quad \\quad \\bar{x}$",
    "$\\operatorname{median}(x)$",
    "$\\operatorname{mode}(x)$"
  ),
  digits=5
)
\(\quad \quad \bar{x}\) \(\operatorname{median}(x)\) \(\operatorname{mode}(x)\)
ln(GDP) 8.54124 8.48673 9.23014
Sanitation 72.43857 85.60000 100.00000
Life Exp. 70.54603 72.40000 73.20000

\(~\)

  • Notice that mode of Sanitation is 100. Thus a large number of countries have universal access to basic sanitation infrastructure.

Measures of Dispersion

Range \(\operatorname{range}(x)\), Semi-Interquartile Range \(\operatorname{SIR}(x)\), Mean Deviation about x’ \(\operatorname{MD}_{(x')}(x)\), Variance \(s_x^2\), Standard Deviation \(s_x\) are some measures of dispersion in the sample.

Formulae

\[ \begin{aligned} \operatorname{range}(x)=|x_{(n)} - x_{(1)}| && \ Q_1 = \operatorname{median}(x_{(1)}, \ldots ,x_{(\lfloor \frac{n+1}{2} \rfloor)}) && \end{aligned} \] \[ \begin{aligned} \ Q_3 = \operatorname{median}(x_{(\lfloor \frac{n+2}{2} \rfloor)}, \ldots , x_{(n)}) && \operatorname{MD}_{(x')}(x) = \operatorname{mean}(|x_i-x'|) \end{aligned} \] \[ \begin{aligned} \operatorname{SIR}(x)=\frac{|Q_1-Q_3|}{2} && s_x = \sqrt{\operatorname{mean}([x_i - \bar{x}]^2)} && s^2_x= (s_x)^2 \end{aligned} \]

Code
getmd = function(x, center = mean(x)){
  md = mean(
    abs(
      x - rep(center, length(x))
      )
    )
  return(md)
}
d_disp = data.frame(
  row.names = "Variable",
  Variable = c(
    "*ln(GDP)*",
    "*Sanitation*",
    "*Life Exp.*"
  ),
  Range = c(
    max(d$lngdp) - min(d$lngdp),
    max(d$snt) - min(d$snt),
    max(d$lfx) - min(d$lfx)
  ),
  SIR = c(
    IQR(d$lngdp)/2,
    IQR(d$snt)/2,
    IQR(d$lfx)/2
  ),
  MD = c(
    getmd(d$lngdp),
    getmd(d$snt),
    getmd(d$lfx)
  ),
  variance = c(
    (sd(d$lngdp))^2,
    (sd(d$snt))^2,
    (sd(d$lfx))^2
  ),
  SD = c(
    sd(d$lngdp),
    sd(d$snt),
    sd(d$lfx)
  )
)

kable(
  d_disp,
  col.names = c(
    "$\\operatorname{range}(x)$",
    "$\\operatorname{SIR}(x)$",
    "$\\operatorname{MD}_{(\\bar{x})}(x)$",
    "$\\quad \\quad \\quad \\quad s_x^2$",
    "$\\quad \\quad \\quad \\quad s_x$"
  ),
  digits=5
)
\(\operatorname{range}(x)\) \(\operatorname{SIR}(x)\) \(\operatorname{MD}_{(\bar{x})}(x)\) \(\quad \quad \quad \quad s_x^2\) \(\quad \quad \quad \quad s_x\)
ln(GDP) 6.04435 1.06914 1.17229 2.01791 1.42053
Sanitation 94.03000 24.65000 25.50487 872.29346 29.53461
Life Exp. 50.80000 6.00000 6.98712 75.33494 8.67957

\(~\)

  • We can compare \(s_x\) to \(\bar{x}\) and observe that there is a high variation in Sanitation amongst countries.
  • GDP per capita varies drastically across the \(e\)6.044354 range.

Measures of Shape

Coefficients of Skewness \(g_1\) and Kurtosis \(g_2\) describe the symmetry and extremity of tails of the sample distribution.

Formulae

\[ \begin{aligned} m_k = \operatorname{mean}([x-\bar{x}]^k) && g_1 = \frac{m_3}{{m_2}^{\frac{3}{2}}} && g_2 = \frac{m_4}{{m_2}^2} \end{aligned} \]

Code
d_shape = data.frame(
  row.names = "Variable",
  Variable = c(
    "*ln(GDP)*",
    "*Sanitation*",
    "*Life Exp.*"
  ),
  Skewness = c(
    skewness(d$lngdp),
    skewness(d$snt),
    skewness(d$lfx)
  ),
  Kurtosis = c(
    kurtosis(d$lngdp),
    kurtosis(d$snt),
    kurtosis(d$lfx)
  )
)

kable(
  d_shape,
  col.names = c(
    "$g_1$",
    "$g_2$"
  ),
  digits=5
)
\(g_1\) \(g_2\)
ln(GDP) 0.09619 2.14435
Sanitation -0.85989 2.27111
Life Exp. -0.87903 4.02465
  • ln(GDP) is nearly symmetrical, while Sanitation and Life Exp. are highly left-skewed.

    This indicates majority of countries have good sanitation system, and citizen health.

  • ln(GDP) and Sanitation are platykurtic, while Life Exp. is leptokurtic.

Density Plot

Density plots help us in estimating the form, location and spread of the distribution.

Code
labelfunction = function(val1){
  return(list(c(
    "log of GDP per capita",
    "Sanitation Access %",
    "Life Expectancy"
  )))
}
ggplot(stack(d[2:4]), mapping = aes(x = values))+
geom_density(aes(color=ind), linewidth=rel(1.5))+
labs(
  x=NULL,
  y=NULL
  )+
mytheme+
mycolor+
facet_wrap(~ind, scales="free", labeller = labelfunction, ncol=1)+
  theme(legend.position="none",
        strip.text.x = element_text(size = rel(1.5))
  )
  • These plots confirm our previous inferences.

Box Plot

Box plots help us detect potential outliers. They also help us in estimating location and skewness of the distribution.

Code
labelfunction = function(val1){
  return(list(c(
    "log of GDP per capita",
    "Sanitation Access %",
    "Life Expectancy"
  )))
}
ggplot(stack(d[2:4]), mapping = aes(y = values))+
geom_boxplot(aes(fill=ind), alpha=0.6)+
labs(
  x=NULL,
  y=NULL
  )+
mytheme+
mycolor+
facet_wrap(~ind, scales="free", labeller = labelfunction)+
  theme(axis.text.x=element_blank(),
        legend.position="none",
        strip.text.x = element_text(size = rel(1.5)),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank()
  )
  • We observe one potential outlier in the Life Expectancy dataset. It is Haiti at 32.5 years. We observe the yearwise dataset and conclude that it is a Typing Error. We decide to discard Haiti from out analysis.

Inferences

Scatter Plot

A Scatter plot helps us estimate the type of relationship between variables.

Code
sctrplot = function(
    d, x_map, y_map,
    x_lab=waiver(), y_lab=waiver(),
    title=waiver()
){
  plot1 = ggplot(d, mapping = aes(x = x_map, y = y_map))+
    geom_point(
      alpha=0.6
    )+
    mytheme+
    labs(
      x=x_lab,
      y=y_lab,
      title=title
    )
  
  return(plot1)
}

Sanitation vs. GDP per Capita

Life Expectancy vs. GDP per Capita

Life Expectation vs. Sanitation

Inferences

seems like Linear correlation

Bivariate Statistics

Correlation Coefficients

Correlation is any relationship, causal or spurious, between two random variables \(x\), \(y\).

Pearson’s \(r\), Spearman’s \(r_s\), and Kendall’s \(\tau\) are some correlation coefficients. These estimate the linear correlation between two variables.

Code
d_cor = data.frame(
  row.names = "Variable",
  Variable = c(
    "*Sanitation vs. ln(GDP)*",
    "*Life Exp. vs. ln(GDP)*",
    "*Life Exp. vs. Sanitation*"
  ),
  Pearson = c(
    cor(d$snt, d$lngdp, method="pearson"),
    cor(d$lfx, d$lngdp, method="pearson"),
    cor(d$lfx, d$snt, method="pearson")
  ),
  Spearman = c(
    cor(d$snt, d$lngdp, method="spearman"),
    cor(d$lfx, d$lngdp, method="spearman"),
    cor(d$lfx, d$snt, method="spearman")
  ),
  Kendall = c(
    cor(d$snt, d$lngdp, method="kendall"),
    cor(d$lfx, d$lngdp, method="kendall"),
    cor(d$lfx, d$snt, method="kendall")
  )
)

kable(
  d_cor,
  digit = 5,
  col.names = c(
    "*Pearson's* $r$",
    "*Spearman's* $r_s$",
    "*Kendall's* $\\tau$"
  )
)
Pearson’s \(r\) Spearman’s \(r_s\) Kendall’s \(\tau\)
Sanitation vs. ln(GDP) 0.80575 0.85830 0.67394
Life Exp. vs. ln(GDP) 0.79340 0.81563 0.62196
Life Exp. vs. Sanitation 0.82612 0.83337 0.63617

Covariance and Correlation Matrices

Covariance \(\operatorname{cov}(x, y)\) is a measure of the joint variability of two random variables \(x\), \(y\).

Formulae

\[ \begin{aligned} \operatorname {cov} (x,y)={\frac {\sum _{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n}} && r_{x,y}= \frac{\operatorname{cov}(x,y)}{s_x s_x} \end{aligned} \]

Code
cov_mat = cov(d[, 2:4])

kable(cov_mat, digits=5)
lngdp snt lfx
lngdp 2.01859 33.68796 9.29249
snt 33.68796 865.95591 200.40379
lfx 9.29249 200.40379 67.95599

\[A_{i,j} = \operatorname{cov}(x_i, x_j)\]

Code
cor_mat = cor(d[, 2:4])

kable(cor_mat, digits=5)
lngdp snt lfx
lngdp 1.00000 0.80575 0.79340
snt 0.80575 1.00000 0.82612
lfx 0.79340 0.82612 1.00000

\[A_{i,j} = r_{x_i, x_j}\]

Inferences

Good linear correlation lets try to observe line of best fit.

Regression

Simple Linear Regression

Simple Univariate Linear Regression is a method for estimating the relationship \(y_i=f(x_i)\) of a response variable \(y\) with a predictor variable \(x\), as a line that closely fits the \(y\) vs. \(x\) scatter plot.

\[ y_i = \hat{a} + \hat{b} x_i + e_i \]

Where \(\hat{a}\) is the intercept, \(\hat{b}\) is the slope, and \(e_i\) is the ith residual error. We aim to minimize \(e_i\) for better fit.

Ordinary Least Squares

Ordinary Least squares method reduces \(e_i\) by minimizing error sum of squares \(\sum{{e_i}^2}\).

Coefficient of Determination \(R^2\) is the proportion of the variation in \(y\) predictable by the model.

Formulae

\[ \begin{aligned} \hat{b} = r\frac{s_y}{s_x} && \hat{a} = \bar{y} - \hat{b}\bar{x} && R^2 = 1 - \frac{\sum{{e_i}^2}}{\sum{(y-\bar{y}})^2} \end{aligned} \]

Code
olssmry = function(
    d, x_map, y_map,
    x_lab=waiver(), y_lab=waiver(),
    title=waiver()
){
  model = lm(formula=y_map~x_map)
  smry = summary(model, signif.stars=TRUE)
  
  smryvec = c(
    as.numeric(model$coefficients["(Intercept)"]),
    as.numeric(model$coefficients["x_map"]),
    smry$r.squared
  )
  
  return(smryvec)
}

olstab = t(data.frame(
  SvG = olssmry(d, d$lngdp, d$snt),
  LvG = olssmry(d, d$lngdp, d$lfx),
  LvS = olssmry(d, d$snt, d$lfx)
))

row.names(olstab) = c(
  "*Sanitation vs. ln(GDP)*",
  "*Life Exp. vs. ln(GDP)*",
  "*Life Exp. vs. Sanitation*"
)

kable(
  olstab,
  digit = 5,
  col.names=c(
  "$\\hat{a}$",
  "$\\hat{b}$",
  "$R^2$"
  )
)
\(\hat{a}\) \(\hat{b}\) \(R^2\)
Sanitation vs. ln(GDP) -69.98571 16.68883 0.64924
Life Exp. vs. ln(GDP) 31.39567 4.60345 0.62949
Life Exp. vs. Sanitation 53.92862 0.23142 0.68248

Least Absolute Deviation

Least absolute Deviation method reduces \(e_i\) by minimizing the sum of absolute deviations \(\sum{|e_i|}\).

Code
ladsmry = function(
    d, x_map, y_map,
    x_lab=waiver(), y_lab=waiver(),
    title=waiver()
){
  model = rq(formula=y_map~x_map)
  smry = summary(model)
  
  smryvec = c(
    as.numeric(model$coefficients[1]),
    as.numeric(model$coefficients[2])
  )
  
  return(smryvec)
}

olstab = t(data.frame(
  SvG = ladsmry(d, d$lngdp, d$snt),
  LvG = ladsmry(d, d$lngdp, d$lfx),
  LvS = ladsmry(d, d$snt, d$lfx)
))

row.names(olstab) = c(
  "*Sanitation vs. ln(GDP)*",
  "*Life Exp. vs. ln(GDP)*",
  "*Life Exp. vs. Sanitation*"
)

kable(
  olstab,
  digit = 5,
  col.names=c(
  "$\\hat{a}$",
  "$\\hat{b}$"
  )
)
\(\hat{a}\) \(\hat{b}\)
Sanitation vs. ln(GDP) -69.02507 16.58740
Life Exp. vs. ln(GDP) 32.10238 4.60164
Life Exp. vs. Sanitation 54.04427 0.23613

Line Fitting

Plotting the estimated Linear Models on the Scatter Plots.

Code
linearplot = function(
    d, x_map, y_map,
    x_lab=waiver(), y_lab=waiver(),
    title=waiver()
){
  olsvec = round(olssmry(d, x_map, y_map), digit=5)
  ladvec = round(ladsmry(d, x_map, y_map), digit=5)
  capstr = TeX(paste("$y_i =", olsvec[1], "+", olsvec[2], "~x_i + e_i$", "\t\t",
                     "$y_i' =", ladvec[1], "+", ladvec[2], "~x_i + e_i'$"))
  
  plot1 = ggplot(d, mapping = aes(x = x_map, y = y_map))+
    geom_point(
      alpha=0.6
    )+
    mytheme+
    labs(
      x=x_lab,
      y=y_lab,
      title=title,
      caption=capstr,
      parse=TRUE
    )+
    geom_smooth(
      method="lm",
      formula=y~x,
      se=FALSE,
      aes(color = "Ordinary Least Squares")
    )+
    geom_smooth(
      method="rq",
      formula=y~x,
      se=FALSE,
      aes(color = "Least Absolute Deviation")
    )+
    labs(
      color="Linear Model"
    )+
    scale_color_manual(
      values=c(
        "#cc241d80",
        "#45858880",
        "#689d6a80",
        "#d65d0e80"
        )
      )+
    theme(
      plot.caption = element_text(hjust=0.5, color="#504945")
    )
  
  return(plot1)
}

Sanitation vs. GDP per Capita

Life Expectancy vs. GDP per Capita

Life Expectancy vs. Sanitation

Inferences

Partial Correlation

Partial Correlation is the relationship between two variables \(x\), \(y\) of interest, after removing effect of some other related variable \(z\).

Formulae

\[ \begin{aligned} &x_i = \hat{a_x} + \hat{b_x} z_i + e_{x,i} && y_i = \hat{a_y} + \hat{b_y} z_i + e_{y,i} \\ &\Rightarrow r_{x,y;z} = r_{e_{x}, e_{y}} \end{aligned} \]

Code
partcor = pcor(d[, 2:4])$estimate

pcortab = data.frame(
  row.names = "Variable",
  Variable = c(
    "*Sanitation vs. ln(GDP)*",
    "*Life Exp.$\\quad$ vs. ln(GDP)*",
    "*Life Exp. vs. Sanitation*"
  ),
  PCor = c(
    partcor[2, 1],
    partcor[3, 1],
    partcor[3, 2]
  )
)

kable(pcortab,
      col.names = c(
        "Partial Correlation"
      ))
Partial Correlation
Sanitation vs. ln(GDP) 0.4382157
Life Exp.\(\quad\) vs. ln(GDP) 0.3828041
Life Exp. vs. Sanitation 0.5182630
  • We observe that the partial correlation between Life Exp. and Sanitation is higher than that of Life. Exp and ln(GDP).

Thus Life Expection is more directly improved by better Sanitaion access, than increase in wealth.

  • We also note that there is fairly high partial correlation between Sanitation and ln(GDP).

This indicates that wealthier countries tend to improve their sanitation infrastructure.

Inferences

Conclusion

Credits